A place for everything
Rodney J. Dyer
A vector is a storage container for data of a uniform data class type.
A vector contains similar data types and each element can be accessed using numerical indices nested with square brackets [ & ].1
Because x is a vector AND it contains numeric data, the introspection operators for both vector and numeric will return TRUE.
The data in x ARE both vectors and numeric types.
As long as the base data type is the exact same, vectors will always work properly.
You CANNOT mix data types in a single vector and keep the same kinds of data. R will coerce to a least common data type so that they are all of the same type.
Sometimes it is helpful to make a a sequence of values in a vector. R has some built-in functionality here for that.
Data within vectors can be subjected to unary opertors.
As well as binary operators.
If you attempt to perform a binary operator on two vectors whose lengths are different, it will recycle the values in the shorter one.
For some mathematical operations, we need to work with matrices. These are another ‘general’ container but with dimensions for rows and columns of data.
Creating matrices are done columnwise, if you want them to be rowwise, you have to ask for it.
Just like vectors, the square brackets are used to access values within a matrix. However, there are now two indices, one for the row and one for the column.
You can get an entire row or column using what is called a slice index.
Arithamatic operators on matrices work the same way (as long as they are matrices of the proper number of rows and columns).
This is element-wise multiplication (aka a Kronecker Product).
Matrix multiplication is a bit more complicated as it is a slightly more involved .
Lists are more versatile containers in that they allow you to store different kinds of data in them.
By default, they are numerically indexed .
Notice that lists use two sets of square brackets instead of one—to differentiate itself from a normal vector
This is because technically, the first element in the list is an also a list and what we are trying to get from that is the first element inside that contained list.
Lists can be made more friendly to you by using actual names for the keys associated with each value. In some languages, like python, these are referred to as dictionaries.
Notice the use of the $ in the output
This $ notation is used to easily grab the contents of the list at that slot.
As well as to add new entries to the list directly.
You can also use the double brackets AND the name of the key as a reference.
However this is even more work and looks a bit less elegant than the $ notation. Also, if you look at the order of operations, you’ll see that the $ notation has a higher precedence in operations than the single or double brackets (see ?Syntax).
In R, you will most likely work with list objects as analysis results rather than as a container to keep your data. Almost all analyses return their values as a list with the included components. Here is an example.
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Here is a quick correlation between the sepal and pedal lengths in some iris data set.
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538
Values
statistic.t 21.6460193457598
parameter.df 148
p.value 1.03866741944978e-47
estimate.cor 0.871753775886583
null.value.correlation 0
alternative two.sided
method Pearson's product-moment correlation
data.name iris$Sepal.Length and iris$Petal.Length
conf.int1 0.827036329664362
conf.int2 0.905508048821454
Printing results show the components of the analysis in a way that makes sense because while it is a list
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538
[1] "htest"
This is awesome because it makes it much easier to use something like the values stored in iris.test to insert the data from our analyses directly (inline) into our text.
There was a significant relationship between sepal and petal length (Pearson’s product-moment correlation, \(\rho =\) 0.872, \(t =\) 21.6, P = 1.04e-47).
The main container that almost all of your data will be contained in is the data.frame.
weight, longitude, survived)Lets consider the following data as indiviudal vectors.
These can be put into a data.frame as:
Each column in a data.frame is a self-contained set of data all of the same type and as such can be summarized.
Just like in a list, the columns of a data.frame are accessed by their names, and we can use the $ notation.
The easiest way to index values in a data.frame is to use the $ notation to grab the column (as a vector object) and then to use the square brackets to access a specific element.
You can also use the numerical indices for both row and column in the data.frame (n.b., it is row first then column).
The size of the elements contained in a data.frame are then relevant.
You will almost never create data.frame objects de novo but instead load data in from some external resource. There are several functions that simplify this within tidyverse so let’s make sure we have it loaded into memory.
Here is a CSV file that is contained in this repository. Since it is a public repository, we can access it from within GitHub using a URL.
url <- "https://raw.githubusercontent.com/DyerlabTeaching/Data-Containers/main/data/arapat.csv"
beetles <- read_csv( url )Rows: 39 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Stratum
dbl (2): Longitude, Latitude
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.